In this homework, we’ll be working on getting you set up with the tools you will need for this class. Once you are set up, we’ll do what we’re here to do: analyze data!
Here’s what we will accomplish by the end of the assignment:
We need two basic sets of tools for this class. We will need
R to analyze data. We will need RStudio to
help us interface with R and to produce documentation of our
results.
R is going to be the only programming language we will use. R is an extensible statistical programming environment that can handle all of the main tasks that we’ll need to cover this semester: getting data, analyzing data and communicating data analysis.
If you haven’t already, you need to download R here: https://cran.r-project.org/.
When we work with R, we communicate via the command line. To help automate this process, we can write scripts, which contain all of the commands to be executed. These scripts generate various kinds of output, like numbers on the screen, graphics or reports in common formats (pdf, word). Most programming languages have several I ntegrated D evelopment E nvironments (IDEs) that encompass all of these elements (scripts, command line interface, output). The primary IDE for R is RStudio.
If you haven’t already, you need to download RStudio here: https://rstudio.com/products/rstudio/download/. You need the free RStudio desktop version.
In each class, we’re going to include some code and text in one file,
and data in another file. You’ll need to download both of these files to
your computer. You need to have a particular place to put these files.
Computers are organized using named directories (sometimes called
folders). Don’t just put the files in your Downloads directory. One
common solution is to created a directory on your computer named after
the class: psc_4175. Each time you access the files, you’ll
want to place them in that directory.
We’re going to grab some data that’s part of the college scorecard and do a bit of analysis on it.
Open RStudio, then create a new .Rmd file.
To do this, click on File → New File →
R Markdown....
You will then be asked to determine a bunch of settings for this
.Rmd document. For example, you can choose whether you want
to create a “Document”, “Presentation”, “Shiny”, or “From Template” on
the left. You can set the “Title:” “Author:” and “Date:” on the
top-right. And you can choose the “Default Output Format:” to be either
“HTML”, “PDF”, or “Word”. You should not change any of these
settings. Their defaults (“Document”, “Untitled”, “[Your
name]”, “[Today’s Date]”, and “HTML”) are sufficient. Just click
“OK”.
Copy the raw code from the psc4175_hw_1.Rmd
file by clicking on the copy button as shown in the image below.
Finally, replace the default code in your R Markdown file with the copied code from the GitHub!
If viewing this as an html file, you can view this gif for more help!
.Rmd files will be the only file format we work in this class. .Rmd files contain three basic elements:
From a .Rmd file you can generate html documents, pdf documents, word documents, slides . . . lots of stuff. All class notes will be in .Rmd. Most assignments will be turned in as .Rmd files, and the guided exercise we’ll have you do? You guessed it, .Rmd.
In the .Rmd file you’ll notice that there are three open
single quotes in a row, like so: ``` This indicates the
start of a “code chunk” in our file. The first code chunk that we load
will include a set of programs that we will need all semester long.
I like to see results in the Console. By default Rstudio will output results from an Rmd file inline– meaning in the document itself. To change this, go to Tools–>global Options–>R Markdown, and uncheck the box for “show output inline for all Rmarkdown documents.”
When we say that R is extensible, we mean that people in the
community can write programs that everyone else can use. These are
called “packages.” In these first few lines of code, I load a set of
packages using the library command in R. The set of packages, called
tidyverse were written by Hadley Wickham and others and
play a key role in his book. To install this set of packages, simply
type in install.packages("tidyverse") at the R command
prompt. Alternatively, you can use the “Packages” pane in the lower
right hand corner of your Rstudio screen. Click on Packages, then click
on install, then type in “tidyverse.”
To run the code below in R, you can:
CMD+RETURNCTRL+RETURN## Get necessary libraries-- won't work the first time, because you need to install them!
# install.packages("tidyverse") # Uncomment this to install
library(tidyverse)
Here’s the thing about packages. There’s a difference between installing a package and calling a package. Installing means that the package is on your computer and available to use. Calling a package means that the commands in the package will be used in this session. A “session” is basically when R has been opened up on your computer. As long as R/Rstudio are open and running, the session is active.
It’s a good practice to shutdown R/Rstudio once you’re no longer working on it, and then to restart it when you begin working again. Otherwise, the working environment can get pretty crowded with data and packages.
Now we’re ready to load in data. The data frame will be our basic way
of interacting with everything in this class. The
sc_debt.Rds (found here: https://github.com/rweldzius/PSC4175_F2024/blob/main/Data/sc_debt.Rds)
data frame contains information from the college scorecard on different
colleges and universities.
tidyverse includes a read_rds() function
that can read data directly from the internet.
df <- read_rds('https://github.com/rweldzius/PSC4175_F2024/blob/main/Data/sc_debt.Rds')
## Error in readRDS(con, refhook = refhook): unknown input format
You’ll notice that the code above starts with df. This
is just an arbitrary name for an object. You could name it
dat or raw or debt or whatever
you want. Then there’s an arrow <-. This is an
assignment operator. Then there’s a function, readRDS, with
parentheses, and an argument “sc_debt.Rds”. Here’s how to think about
this.
readRDS opens a type of data– rds data. This
function has one argument which is the name of the file I want to
open.So the command above says “use readRDS to open the
file”sc_debt.Rds” and assign the result to the object
df.
Let’s take a quick look at the object df
df
## function (x, df1, df2, ncp, log = FALSE)
## {
## if (missing(ncp))
## .Call(C_df, x, df1, df2, log)
## else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x1507bdc88>
## <environment: namespace:stats>
This is just the first part of the data frame. All data frames have
the exact same structure. Each row is a case. In this example, each row
is a college. Each column is a characteristics of the case, what we call
a variable. Let’s use the names command to see what
variables are in the dataset.
names(df)
## NULL
It’s hard to know what these mean without some more information. We
usually use a codebook to get more information about a dataset. Because
we use very short names for variables, it’s useful to have some more
information (fancy name: metadata) that tells us about those variables.
Below you’ll see the R name for each variable next to a
description of each variable.
| Name | Definition |
|---|---|
| unitid | Unit ID |
| instnm | Institution Name |
| stabbr | State Abbreviation |
| grad_debt_mdn | Median Debt of Graduates |
| control | Control Public or Private |
| region | Census Region |
| preddeg | Predominant Degree Offered: Associates or Bachelors |
| openadmp | Open Admissions Policy: 1= Yes, 2=No,3=No 1st time students |
| adm_rate | Admissions Rate: proportion of applications accepted |
| ccbasic | Type of institution– see here |
| selective | Institution admits fewer than 10 % of applicants, 1=Yes, 0=No |
| research_u | Institution is a research university 1=Yes, 0=No |
| sat_avg | Average Sat Scores |
| md_earn_wne_p6 | Average Earnings of Recent Graduates |
| ugds | Number of undergraduates |
| costt4a | Average cost of attendance (tuition-grants) |
We can also look at the whole dataset using View. Just delete the
# sign below to make the code work. That #
sign is a comment in R code, which indicates to the computer that
everything on that line should be ignored. To get it to run, we need to
drop the #.
#View(df)
You’ll notice that this data is arranged in a rectangular format, with each row showing a different college, and each column representing a different characteristic of that college. Datasets are always structured this way— cases (or units) will form the rows, and the characteristics of those cases– or variables— will form the columns. Unlike working with spreadsheets, this structure is always assumed for datasets.
In exploring data, many times we want to look at smaller parts of the dataset. There are three commands we’ll use today that help with this.
-filter selects only those cases or rows that meet some
logical criteria.
-select selects only those variables or columns that
meet some criteria
-arrange arranges the rows of a dataset in the way we
want.
For more on these, please see this vignette.
Let’s grab just the data for Villanova, then look only at the average test scores and admit rate. We can use filter to look at all of the variables for Villanova:
df%>%
filter(instnm=="Villanova University")
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
What’s that weird looking %>% thing? That’s called a
pipe. This is how we chain commands together in R. Think of it as saying
“and then” to R. In the above case, we said, take the data and
then filter it to be just the data where the institution name is
Vanderbilt University.
The command above says the following:
Take the dataframe df and then filter it to
just those cases where instnm is equal to “Villanova
University.” Notice the “double equals” sign, that’s a logical operator
asking if instnm is equal to “Villanova University.”
Many times, though we don’t want to see everything, we just want to
choose a few variables. select allows us to select only the
variables we want. In this case, the institution name, its admit rate,
and the average SAT scores of entering students.
df%>%
filter(instnm=="Villanova University")%>%
select(instnm,adm_rate,sat_avg)
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
filter takes logical tests as its argument. The code
insntnm=="Villanova University" is a logical statement that
will be true of just one case in the dataset– when institution name is
Vanderbilt University. The == is a logical test, asking if
this is equal to that. Other common logical and relational operators for
R include
>, <: greater than, less than>=, <=: greater than or equal to,
less than or equal to! :not, as in != not equal to& AND| ORNext, we can use filter to look at colleges with low
admissions rates, say less than 10% ( or .1 in the proportion scale used
in the dataset).
df%>%
filter(adm_rate<.1)%>%
select(instnm,adm_rate,sat_avg)%>%
arrange(sat_avg,adm_rate)%>%
print(n=20)
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
Now let’s look at colleges with low admit rates, and order them using
arrange by SAT scores (-sat_avg gives
descending order).
df%>%
filter(adm_rate<.1)%>%
select(instnm,adm_rate,sat_avg)%>%
arrange(-sat_avg)
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
And one last operation: all colleges that admit between 20 and 30 percent of students, looking at their SAT scores, earnings of attendees six years letter, and what state they are in, then arranging by state, and then SAT score.
df%>%
filter(adm_rate>.2&adm_rate<.3)%>%
select(instnm,sat_avg,grad_debt_mdn,stabbr)%>%
arrange(stabbr,-sat_avg)%>%
print(n=20)
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
Quick Exercise Choose a different college and two different things about that college. Have R print the output.
# INSERT CODE HERE
To summarize data, we use the summarize command. Inside
that command, we tell R two things: what to call the new variable that
we’re creating, and what numerical summary we would like. The code below
summarizes median debt for the colleges in the dataset by calculating
the average of median debt for all institutions.
df%>%
summarize(mean_debt=mean(grad_debt_mdn,na.rm=TRUE))
## Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "function"
df%>%
summarize(median_debt=median(grad_debt_mdn,na.rm=TRUE))
## Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "function"
Quick Exercise Summarize the average entering SAT scores in this dataset.
# INSERT CODE HERE
We can also combine commands, so that summaries are done on only a part of the dataset. Below, we summarize median debt for selective schools, and not very selective schools.
df%>%
filter(adm_rate<.1)%>%
summarize(mean_debt=mean(grad_debt_mdn,na.rm=TRUE))
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
What about for not very selective schools?
df%>%
filter(adm_rate>.3)%>%
summarize(mean_debt=mean(grad_debt_mdn,na.rm=TRUE))
## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "function"
Quick Exercise Calculate average earnings for schools where SAT>1200
# INSERT CODE HERE
Quick Exercise Calculate the average debt for schools that admit over 50% of the students who apply.
# INSERT CODE HERE